data("Default")
dim(Default)
## [1] 10000 4
head(Default)
## default student balance income
## 1 No No 729.5265 44361.625
## 2 No Yes 817.1804 12106.135
## 3 No No 1073.5492 31767.139
## 4 No No 529.2506 35704.494
## 5 No No 785.6559 38463.496
## 6 No Yes 919.5885 7491.559
dt <- data.table(Default)
dt[, .N, by = default]
## default N
## 1: No 9667
## 2: Yes 333
ggplot(Default, aes(x = balance, y = income, color = default)) + geom_point()
ggplot(Default, aes(x = factor(default), y = balance)) + geom_boxplot(aes(fill = factor(default)))
It is inferred fron the graph that those with high balance have defferd the credit card and those with low balance have not. The income of those defaluting and those not defaulting is in the same range
Any time a straight line is fit to a binary response that is coded as 0 or 1, in principle we can always predict p(X) < 0 for some values of X and p(X) > 1 for others (unless the range of X is limited).
To avoid this problem, we must model p(X) using a function that gives outputs between 0 and 1 for all values of X.
In logistic regression one unit increase in the predictor value will result in log(odds) of the response
logreg <- glm(default~balance, data = Default, family = binomial)
summary(logreg)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.651330614 0.3611573721 -29.49221 3.623124e-191
## balance 0.005498917 0.0002203702 24.95309 1.976602e-137
The z-statistic above plays teh same role as the t-statistic in linear regression
str(Smarket)
## 'data.frame': 1250 obs. of 9 variables:
## $ Year : num 2001 2001 2001 2001 2001 ...
## $ Lag1 : num 0.381 0.959 1.032 -0.623 0.614 ...
## $ Lag2 : num -0.192 0.381 0.959 1.032 -0.623 ...
## $ Lag3 : num -2.624 -0.192 0.381 0.959 1.032 ...
## $ Lag4 : num -1.055 -2.624 -0.192 0.381 0.959 ...
## $ Lag5 : num 5.01 -1.055 -2.624 -0.192 0.381 ...
## $ Volume : num 1.19 1.3 1.41 1.28 1.21 ...
## $ Today : num 0.959 1.032 -0.623 0.614 0.213 ...
## $ Direction: Factor w/ 2 levels "Down","Up": 2 2 1 2 2 2 1 2 2 2 ...
head(Smarket)
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
## 1 2001 0.381 -0.192 -2.624 -1.055 5.010 1.1913 0.959 Up
## 2 2001 0.959 0.381 -0.192 -2.624 -1.055 1.2965 1.032 Up
## 3 2001 1.032 0.959 0.381 -0.192 -2.624 1.4112 -0.623 Down
## 4 2001 -0.623 1.032 0.959 0.381 -0.192 1.2760 0.614 Up
## 5 2001 0.614 -0.623 1.032 0.959 0.381 1.2057 0.213 Up
## 6 2001 0.213 0.614 -0.623 1.032 0.959 1.3491 1.392 Up
library(GGally)
##
## Attaching package: 'GGally'
##
## The following object is masked from 'package:dplyr':
##
## nasa
select <- dplyr::select
ggpairs(data = Smarket, aes(color = Direction))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
cor(select(Smarket, -9))
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 0.029699649 0.030596422 0.033194581 0.035688718
## Lag1 0.02969965 1.000000000 -0.026294328 -0.010803402 -0.002985911
## Lag2 0.03059642 -0.026294328 1.000000000 -0.025896670 -0.010853533
## Lag3 0.03319458 -0.010803402 -0.025896670 1.000000000 -0.024051036
## Lag4 0.03568872 -0.002985911 -0.010853533 -0.024051036 1.000000000
## Lag5 0.02978799 -0.005674606 -0.003557949 -0.018808338 -0.027083641
## Volume 0.53900647 0.040909908 -0.043383215 -0.041823686 -0.048414246
## Today 0.03009523 -0.026155045 -0.010250033 -0.002447647 -0.006899527
## Lag5 Volume Today
## Year 0.029787995 0.53900647 0.030095229
## Lag1 -0.005674606 0.04090991 -0.026155045
## Lag2 -0.003557949 -0.04338321 -0.010250033
## Lag3 -0.018808338 -0.04182369 -0.002447647
## Lag4 -0.027083641 -0.04841425 -0.006899527
## Lag5 1.000000000 -0.02200231 -0.034860083
## Volume -0.022002315 1.00000000 0.014591823
## Today -0.034860083 0.01459182 1.000000000
Smarket %>% group_by(Year) %>% summarize(sum(Volume))
## Source: local data frame [5 x 2]
##
## Year sum(Volume)
## 1 2001 296.9218
## 2 2002 359.9697
## 3 2003 348.9427
## 4 2004 358.8879
## 5 2005 483.1591
Smarket %>% group_by(Year) %>% summarize(sum1 = sum(Volume)) %>% ggplot(aes(x = Year, y = sum1)) + geom_line(col = "red") + geom_point(col = "blue")
glm.fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, family = "binomial", data = Smarket)
glm.predict <- predict(glm.fit, type = "response")
glm.predict <- data.frame(glm.predict)
glm.predict <- glm.predict %>% mutate(Direction = ifelse(glm.predict > 0.5, "Up", "Down"))
table(glm.predict$Direction, Smarket$Direction)
##
## Down Up
## Down 145 141
## Up 457 507
When the observations are drawn from a aguassian distribution with common covariance matrix then LDA is prefferred ove LogR. If these conditons are not met then LogR givesbetter results
In knn no assumptions are made about the boundaries. Hence when the decision boundary is highly nonlinear knn gives better performance over the other two but it does not give info whether which variable is more importnat than teh other
QDA is not as flexible as KNN but can be an effective comprimise between KNN and LDA/LogR.
When decision boundaries are linear approaches like LDA/LogR perform on the same level where as in a moderately no-linear case QDA may give better results. for an non-parametric approach KNN can outperform the other methods
It is based on gaussian density. The variances are equal in each class and hence get cancelled in teh discriminant function. Hence it is called Linear Discriminant Analysis
If there is a perfect variables that can be used to classify perfectly logistic regression becomes unstable. The parameters tend to move to infinity. This is because logistic regresssion was developed by biologists and people working in medical field where we cannot find such parameters
When the distibutiond of predictors X are normal then LDA is more stable than logistic regression
Both QDA and LDA breakdown with large number of variables and in that case naive bayes becomes more attractive
We cannot use LDA if there are high number of variables.
If there are K classes to classified K-1 dimesnional plot can be used to distinguish them . we can find teh best two dimensional plot to represent the K-1 classes if they are more than 2 but we will have to comprimise on teh magnitude of error.
The largest probability to which it is assigned is regarded as the winning unit.
ggpairs(iris, aes(color = Species, alpha = 0.8))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Training errors are always less than test errors. Inshort they are overly optimistic because there is a chance of overfitting. If the training test are small then the chances for overfitting are quite high
FALSE POSITIVES: class of negative examples that are classfief as positive
FALSE NEGATIVES: class of posotive examples that are classified as negatives
We can change the threshold by changinf the probability. In order to reduce the false negative rates we have to devrease teh threshold to 0.1 or less
In classification u can tolerate a bias if u canget good classification and in return you get much less variance in return. hence naive bayes is useful in classification
MDA is for more than 3 class classification
MDA analyzes patterns and projects them onto a two dimensionl subspace that can give a better seperatuon of teh classes. The idea is to reduce the dimensions with minimum loss of information. There are two important aims of MDA
Both Multiple Discriminant Analysis (MDA) and Principal Component Analysis (PCA) are linear transformation methods and closely related to each other. In PCA, we are interested to find the directions (components) that maximize the variance in our dataset, where in MDA, we are additionally interested to find the directions that maximize the separation (or discrimination) between different classes (for example, in pattern classification problems where our dataset consists of multiple classes. In contrast two PCA, which ignores the class labels).
n other words, via PCA, we are projecting the entire set of data (without class labels) onto a different subspace, and in MDA, we are trying to determine a suitable subspace to distinguish between patterns that belong to different classes. Or, roughly speaking in PCA we are trying to find the axes with maximum variances where the data is most spread (within a class, since PCA treats the whole data set as one class), and in MDA we are additionally maximizing the spread between classes.
When the observations are drawn from a aguassian distribution with common covariance matrix then LDA is prefferred ove LogR. If these conditons are not met then LogR givesbetter results
In knn no assumptions are made about the boundaries. Hence when the decision boundary is highly nonlinear knn gives better performance over the other two but it does not give info whether which variable is more importnat than teh other
QDA is not as flexible as KNN but can be an effective comprimise between KNN and LDA/LogR.
When decision boundaries are linear approaches like LDA/LogR perform on the same level where as in a moderately no-linear case QDA may give better results. for an non-parametric approach KNN can outperform the other methods
The key concepts remain teh same irrespective of the response being qualitatve or quantitative
The trainig data is too optimihestic. The morewe fit the data the less we get the training error. But the test data will be showing a lot of errors
These resampling methods give us the SD and bias of the estimates
The training error does not tell anything about overfitting because its including the same data. The higher the parmateers the better it looks
The ingedrients of the predicitons error are BIAS & VARIANCE
when we dont fit too hard the bias is low and the variance is high as the number of parmaters are low but as we move towards the right, the bias goes down as the model can adapt more to the subtelties in the data but the variance goes up as there are more number of paramters
The point at which the prediciton error is minimum is the tradeoff point. And this is reffered to as the bias variance tradeoff.
Depending on the split the error rates varies by a large extent. This vallidates that the split plays a major role in determinign the error
Validation can be used for two insights: * what order polynomial is the best[How good the model is?] * How good the error is at teh end of the fitting process
In LOOCV, hi indicates how much influence teh observation has on its own fit. So if has good influence then hi penalizes the residual by dividing it by \(1-hi\)
LOOCV does ot shake up the data as each one of training sets looks like the otehr and when we take out the avergae of errors which are highly correlated is with high variance.Hence it suggested to use KNN iwth K = 5/10
Picking K is also a tradeoff for bias-variance tradeoff
With K=10 there is not much variation in teh MSE with different splits and hence it does and good job commpared to K=1
The SE of an estimator is the SD of the sampling distribution. If youa re able to recompute the estimate many times the SD is teh SE
Confidence Interval : If we were to perform experiment many times than the confint will contain the true value of estimates 95% of times
The bootstrap percentile is the simplest way of constructing confidence interval for Bootsrap
In Cross Validation you do not have a overlap between the validation and the trainig sets which is crucial for its success
In Bootsrap if each sample is considered as training dataset and is validated over the original dataset there is a significant anount of overlap. About 2/3rd of the observations are generally same
library(ISLR); library(leaps); library(ggplot2)
data(Hitters)
subsetfit <- regsubsets(Salary ~ ., data = Hitters)
summary(subsetfit)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters)
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " "
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " "
## 7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " "
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*"
## CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) "*" " " " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " " " " " " "
## 3 ( 1 ) "*" " " " " " " "*" " " " " " "
## 4 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 5 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 6 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 7 ( 1 ) " " " " " " "*" "*" " " " " " "
## 8 ( 1 ) " " "*" " " "*" "*" " " " " " "
subsetfit <- regsubsets(Salary ~ ., data = Hitters, nvmax = 19)
names(summary(subsetfit))
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
k <- summary(subsetfit)
Linear Splines: Piece wise linear model continouis at each knot
Piece wise cubic polynomials with continous derivatives upto the order 2 at each knot. The truncated power functions will be raised to the power 3 The idea is the same as linear but it is cubic in this case
The boundaries are made linear Two extra constants at the boundaries i.e the constarints arent left idle at the boundaries * values of x smaller than the smallest knot
For same number of degrees of freedom in the model you can get extra knots in the model
A cubic spline with K knots has
k+4degrees of freedom where as a natural spline with K knots has K degrees of freedom
Natural cubic Splines produce more stable estimates and have narrower confidence intervals than cubic splines
The point here is to fit a function that minimizes the RSS and is smooth i.e not woefully overfit.
We do not have to worry abt the knots
We add second derivatives squared and integrated over the whole domain to the RSS. The second derivatives constarins the functions over which we search to be smooth.
They search for the wiggles in the function i.e adds up all the non linear in teh function and lambda is the tuning parametercalled tuning parameter or roughness penalty greater than zero.
The samller the lambda the higher the penalty and more wiggly the function can be
require(ISLR)
data(Wage)
fit1 <- lm(wage ~ poly(age, 4), data = Wage)
summary(fit1)
##
## Call:
## lm(formula = wage ~ poly(age, 4), data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -98.707 -24.626 -4.993 15.217 203.693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.7036 0.7287 153.283 < 2e-16 ***
## poly(age, 4)1 447.0679 39.9148 11.201 < 2e-16 ***
## poly(age, 4)2 -478.3158 39.9148 -11.983 < 2e-16 ***
## poly(age, 4)3 125.5217 39.9148 3.145 0.00168 **
## poly(age, 4)4 -77.9112 39.9148 -1.952 0.05104 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.91 on 2995 degrees of freedom
## Multiple R-squared: 0.08626, Adjusted R-squared: 0.08504
## F-statistic: 70.69 on 4 and 2995 DF, p-value: < 2.2e-16
## Using raw and not orthogonal values
fit2=lm(wage~poly(age,4,raw=T),data=Wage)
summary(fit2)
##
## Call:
## lm(formula = wage ~ poly(age, 4, raw = T), data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -98.707 -24.626 -4.993 15.217 203.693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.842e+02 6.004e+01 -3.067 0.002180 **
## poly(age, 4, raw = T)1 2.125e+01 5.887e+00 3.609 0.000312 ***
## poly(age, 4, raw = T)2 -5.639e-01 2.061e-01 -2.736 0.006261 **
## poly(age, 4, raw = T)3 6.811e-03 3.066e-03 2.221 0.026398 *
## poly(age, 4, raw = T)4 -3.204e-05 1.641e-05 -1.952 0.051039 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.91 on 2995 degrees of freedom
## Multiple R-squared: 0.08626, Adjusted R-squared: 0.08504
## F-statistic: 70.69 on 4 and 2995 DF, p-value: < 2.2e-16
# Both fit2a and fit2b are same as the fit2
fit2a=lm(wage~age+I(age^2)+I(age^3)+I(age^4),data=Wage)
fit2b=lm(wage~cbind(age,age^2,age^3,age^4),data=Wage)
agelims=range(Wage$age)
age.grid=seq(from=agelims[1],to=agelims[2])
preds=predict(fit1,newdata=list(age=age.grid),se=TRUE)
se.bands=data.frame(cbind(preds$fit+2*preds$se.fit,preds$fit-2*preds$se.fit))
head(se.bands)
## X1 X2
## 1 62.52799 41.33491
## 2 67.23827 49.75522
## 3 71.75608 57.38768
## 4 76.09436 64.27111
## 5 80.26558 70.44323
## 6 84.27774 75.94470
g <- ggplot() + geom_point(data = Wage, aes(age, wage)) +xlim(18,80)
g
g2 <- g + geom_line(aes(x = age.grid, y = preds$fit, color = "red"), size = 1.5)
g2
g3 <- g2 + geom_line(data = se.bands, aes(x = age.grid, y = se.bands[,1]), linetype = 2, color = "blue") + geom_line(data = se.bands, aes(x = age.grid, y = se.bands[,2]), linetype = 2, color = "blue")
g3 + theme_classic()
SVM present a linear solution to the widely used logsitic regression and linear regression, maximizing the distacne between the classes invlolved. The concept of maximizing the distance betweent he two classes implies that if new data is presented then it will make better classification
Find the plane that seperates tha features and classes. One of teh best way to do classification
This is a classic exmple of how computer scientists would approach the problem. There is no probability model instead a hyperplane that seperates the classes
A hyperplane is a linear equation equal to zero
A hyperplane in p dimensions if we have p variables is a flat affine subspace of dimension p-1. If the number of variables are two then the hyperplane will be a line.
The beta vector(i.e the coefficients vector excluding the intercept) is called the normal vector perpendicular to the hyperplane and the points
we project the points onto the noraml orthogonally and the distance from the origin to the point of projection is taken. The points on the hyper plane are at the same distacen as the value of the intercept
The direction vectors are the unit vectors so this implies that the distance we are measuring the eucleadian distance
If we have more than one classifier which does a good job in classification what is teh best one to choose. We go with the concept of maximum margin classifier i.e
The reason behind this concept is that if the gap is more with the training data then it will also maintain that gap with the test data as well.
We cannot fit a hyper plane through the data as it oftern gets overlapped. When N is far greater than no of variables then we cannot create a hyperplane
In cases where the No of observations are less than variables especially in cases related to genomics there we can use a hyperplane which is linear in nature
If there are data points which are nosiy i.e the entire model can be effected with the prescence of that datapoint then even though an hyperplane can be created this will be a poor maximum margin classifier
In cases like these we can relax teh criteria of maximum margin classifier and maximize something called as a soft margin
We extend the margin to more than the least distance and call this a soft margin. This is similar to regularization as we are extending the margin it gets determined more than just close points
We multiply the minimum margin by a slack and this slack is accounted within a budget i.e the total slack is within the given value of the budget. Subject to this budget we get the maximizxed margin. Here C is the tuning paramter just as lambda in regularization parametrs. As we change C we get different value of margins
The biggest value of C when all the points are on the wrong side of teh margins. So there is an error for every single point
All the points that are on the wrong side of teh margin are the ones that control in effect the orientation of the margin. So as the points that are invlolved in the orientation if the margin are higher the more stable it beccome. The higher the C the higher the stability. This is the bias variance tradeoff
The SVM kepps teh units as same we need to standardize the variables.
K>p dimensional spaceInteractions between every combination of points in the input space
Non linear transformations get wild rather fast hence they are not an appropriate choice. Evn in regression we do not prefer to use degree more than 3 even in case of large variables
Each observation in dataset is combined with every other data i.e np2 observations and combined with the intercept to form a support vector classifier
Most of the alpha values will be zero. The ones wrongly classified outside the margins and on the margins are gonna have non zero alpha values. The ones wrongly classified have the error term associated with them
If we move a point within the correct classification to other spce in the correct classification its not ginna effect the output
This sparsity is quite different from the sparsity of the lasso. In lasso we had coefficients that are zero. Her we have the datapoints that are zero
Sparsity is in the data space not in the feature space
The algorithm transfers the input feature space to higher dimension and find a linear dimension solution in that space but can be non linear in the original input space
Kernel function is a function of two arguments in this case two p vectors. The inner product added with 1 is expanded over p dimensional space with d degree polynomial degree. The no of basis function will be p+d permutation d
Once again the alpha will be zero in the wrongly classified Kernels come into the support vector machine
One of the most famous kernel is radial basis function kernel. This is a high dimensional spcae yet the kernel computes the producct for us
Even though it is infinite dimensional feature space most of the dimensions are squashed down. Most of the ones squashed down are wiggly ones
In polynomial kernels when the d is million and with feature expalnsion things will get out of control whereas in kernel beacuse of the squashing down things get easier. However, since the SVM can do this without actually forming the features, you can raise the degree as high as you like and it will still work
A plot between the True positives and the false positives is plotted for different values of gamma in radial kernel. The larger the gamma the more wiggly the curve. On the training data we observe that by decreasing gamm we o worse on the training data
Suppor Vector Machines are powerful classification algorithm. When used in conjunction with random forest and other machine learning tools, they give a very different dimension to ensemble models. They become very useful in cases where the number of variables are high thn the observations and when very high predictive power is required
Toddler is asked to seperate a set of 10 houses and cars made of lego. He makes the distinction based on the similar apperance
Low dimensional representation of the data that explains good fraction of variance
One of teh most widely used tool in applied statistics
For a n x p dataset we first center all the variables(p) i.e the column means of all the P variables is zero and the Standard deviation to be 1.
Eigen values are produced in such a way that their least squares sum is equal to 1. The meaning of this eigen vectors is that the data varies the most along these vectors.
Then n Z values are generated whose sum has the highest variance. This sum is the first principal component
The second principal component has maximum variance among all teh linear combinations that are uncorrelated to the first principal component.
Being uncorrelated to the first principal component means that the second principal component is orthognal to teh first one. At most there can be a min(n-1,p) principal components
The component explaing the largest variance is the highest principal component
The hyper plane passess throught eh middle of these points. The disance from the hyper plane for each point is calculated and the sum of squares of those points gives a plane which is closest to qll the data.
The hyperplane is defined in terms of the two largets PC i.e the two direction vectors are the ones that define them are the ones of the PC
In linear regression we focus on the distance Y which is not the perpendicular distance but in PCA we foccus on teh shortest distance from the hyper plane
In PCA we are taking all teh n dimesnions into consideration before calculating the mean square distance
\(Total variance = Sum of all teh variances\)
The sum of variance of all teh variables is equal to teh sum of variance of all the Z with the no of PC being
min(n-1, p)
\(PVE of the mth principal component = Z of mth principal component/Sum of variance of X\)
PVE is a value between 0 and 1 and its value decrease with the increase in teh principal component. The sum of PVE of all teh PC is equal to 1. The PVE decrease because the principal components are uncorrelated to each other.
There is no simple answer to how many PC we need but a rough estimate is If the first n principal compenents can explain about 95% of the variance then they are sufficent for how many PC we need
Ther is no Y variable(response) to conduct Cross Validation on this. If in regression we have a large no of variables then we conduct a PCA and Cross validate it. In that case Cv is useful
If there is a elbow in the scree plot then we can say that the no of Principal components before that elbow are sufficient to explain PCA
Pre-specified number of clusters
Varibales that tend to divide tehh llcusters easily and significantly also tend to have high variance. Hence there is a significant relation between PCA and clustering
A good cluster is defined as one for which teh within cluster variation is as minimum as possible i.e we minimize teh within clluster variation[Eucledian distance]
The beginning of the alogorithm is the random asignment of point to the no of clusters you are using.
Specify the no of clusters
Identifies the distance between each points and then form a link between the closest points. Then the next closest points are also linked.
Then the next higher no of groups are clusetred and then the clusters are grouped within
Importnat things to be considered in hierarchial clustering: 1. Which dissimilarity measure should be used 2. What type of linkage should be used 3. How many clusters to use 4. which features to be used in clustering
Distance between the farthest points
Distance between the closest points. It tends to produce long string clusters
Distancebetween all the points and avergaing them
Centroid for each cluster is atken and then distance between them is taken. This is more common in genomics
Complete and averge are the most widely used
Correlation based distance: Observation are close is measured in terms of corrleation between teh variables